{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing and Analyzing Jigsaw" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous section, we explored how to generate topics from a textual dataset using LDA. But how can this be used as an application? \n", "\n", "Therefore, in this section, we will look into the possible ways to read the topics as well as understand how it can be used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now import the preloaded data of the LDA result that was achieved in the previous section. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"https://raw.githubusercontent.com/dudaspm/LDA_Bias_Data/main/topics.csv\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Topic 0 wordsTopic 0 weightsTopic 1 wordsTopic 1 weightsTopic 2 wordsTopic 2 weightsTopic 3 wordsTopic 3 weightsTopic 4 words...Topic 5 wordsTopic 5 weightsTopic 6 wordsTopic 6 weightsTopic 7 wordsTopic 7 weightsTopic 8 wordsTopic 8 weightsTopic 9 wordsTopic 9 weights
00trump3452.3mental3351.9canada591.5mental1186.5gun...school840.5mental1058.1white1220.1mental1836.1god954.9
11presid1031.5ill1993.1muslim582.0peopl708.3mental...kid723.0comment848.3peopl1076.2peopl1793.0one934.0
22vote813.8health1213.7countri539.3drug555.8peopl...year590.5like678.6black651.0health1464.6women905.2
33like780.9medic706.8us519.8ill538.9law...go514.7would668.2disord537.1homeless1367.5life830.1
44elect579.5http630.5world490.3health497.7kill...time507.9think650.4person529.5care1296.8peopl798.2
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 Topic 0 words Topic 0 weights Topic 1 words Topic 1 weights \\\n", "0 0 trump 3452.3 mental 3351.9 \n", "1 1 presid 1031.5 ill 1993.1 \n", "2 2 vote 813.8 health 1213.7 \n", "3 3 like 780.9 medic 706.8 \n", "4 4 elect 579.5 http 630.5 \n", "\n", " Topic 2 words Topic 2 weights Topic 3 words Topic 3 weights Topic 4 words \\\n", "0 canada 591.5 mental 1186.5 gun \n", "1 muslim 582.0 peopl 708.3 mental \n", "2 countri 539.3 drug 555.8 peopl \n", "3 us 519.8 ill 538.9 law \n", "4 world 490.3 health 497.7 kill \n", "\n", " ... Topic 5 words Topic 5 weights Topic 6 words Topic 6 weights \\\n", "0 ... school 840.5 mental 1058.1 \n", "1 ... kid 723.0 comment 848.3 \n", "2 ... year 590.5 like 678.6 \n", "3 ... go 514.7 would 668.2 \n", "4 ... time 507.9 think 650.4 \n", "\n", " Topic 7 words Topic 7 weights Topic 8 words Topic 8 weights \\\n", "0 white 1220.1 mental 1836.1 \n", "1 peopl 1076.2 peopl 1793.0 \n", "2 black 651.0 health 1464.6 \n", "3 disord 537.1 homeless 1367.5 \n", "4 person 529.5 care 1296.8 \n", "\n", " Topic 9 words Topic 9 weights \n", "0 god 954.9 \n", "1 one 934.0 \n", "2 women 905.2 \n", "3 life 830.1 \n", "4 peopl 798.2 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will visualize these results to understand what major themes are present in them. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", "
Made with Flourish
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
Made with Flourish
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An Overview of the analysis \n", "From the above visualization, an anomaly that we come across is that the dataset we are examining is supposed to be related to people with physical, mental, and learning disabilities. But unfortunately, based on the topics that were extracted, we notice just a small subset of words that are related to this topic. \n", "Topic 2 has words that address themes related to what we were expecting the dataset to have. But the major theme that was noticed in the Top 5 topics are main terms that are political. \n", "(The Top 10 topics show themes related to Religion as well, which is quite interesting.)\n", "LDA hence helped in understanding what the conversations the dataset consisted. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the word collection, we also notice that there were certain words such as \\'kill' that can be categorized as \\'Toxic'\\. To analyze this more, we can classify each word because it can be categorized wi by an NLP classifier. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To demonstrate an example of a toxic analysis framework, the below code shows the working of the Unitary library in python. {cite}`Detoxify`\n", "\n", "This library provides a toxicity score (from a scale of 0 to 1) for the sentence that is passed through it." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "headers = {\"Authorization\": f\"Bearer api_ZtUEFtMRVhSLdyTNrRAmpxXgMAxZJpKLQb\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get access to this software, you will need to get an API KEY at https://huggingface.co/unitary/toxic-bert\n", "Here is an example of what this would look like.\n", "```python\n", "headers = {\"Authorization\": f\"Bearer api_XXXXXXXXXXXXXXXXXXXXXXXXXXX\"}\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "API_URL = \"https://api-inference.huggingface.co/models/unitary/toxic-bert\"\n", "\n", "def query(payload):\n", " response = requests.post(API_URL, headers=headers, json=payload)\n", " return response.json()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[{'label': 'toxic', 'score': 0.9272779822349548},\n", " {'label': 'severe_toxic', 'score': 0.00169223896227777},\n", " {'label': 'obscene', 'score': 0.03694247826933861},\n", " {'label': 'threat', 'score': 0.0017220545560121536},\n", " {'label': 'insult', 'score': 0.02829463966190815},\n", " {'label': 'identity_hate', 'score': 0.004070617724210024}]]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query({\"inputs\": \"addict\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can input words or sentences in \\, in the code, to look at the results that are generated through this.\n", "\n", "This example can provide an idea as to how ML can be used for toxicity analysis." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[{'label': 'toxic', 'score': 0.5101907849311829},\n", " {'label': 'severe_toxic', 'score': 0.07646821439266205},\n", " {'label': 'obscene', 'score': 0.12113521993160248},\n", " {'label': 'threat', 'score': 0.07763686031103134},\n", " {'label': 'insult', 'score': 0.11923719942569733},\n", " {'label': 'identity_hate', 'score': 0.09533172845840454}]]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query({\"inputs\": \"\"})" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", "
Made with Flourish
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
Made with Flourish
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The Bias\n", "The visualization shows how contextually toxic words are derived as important words within various topics related to this dataset. These toxic words can lead to any Natural Language Processing kernel learning this dataset to provide skewed analysis for the population in consideration, i.e., people with mental, physical, and learning disabilities. This can lead to very discriminatory classifications. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### An Example\n", "To illustrate the impact better, we will be taking the most associated words to the word 'mental' from the results. Below is a network graph that shows the commonly associated words. It is seen that words such as 'Kill' and 'Gun' appear with the closest association. This can lead to the machine contextualizing the word 'mental' to be associated with such words. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
Made with Flourish
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is hence essential to be aware of the dataset that is being used to analyze a specific population. With LDA, we were able to understand that this dataset cannot be used as a good representation of the disabled community. To bring about a movement of unbiased AI, we need to perform such preliminary analysis and more not to cause unintended discrimination. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Dashboard\n", "\n", "Below is the complete data visualization dashboard of the topic analysis. Feel feel to experiment and compare various labels to your liking. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Thank you!\n", "\n", "We thank you for your time! " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }